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ABSTRACT 


The  stack  decoder  is  an  attractive  algorithm  for  controlling  the  acoustic  and 
language  model  matching  in  a  continuous  speech  recognizer.  It  implements  a  best- 
first  tree  search  to  find  the  best  match  to  both  the  language  model  and  the  observed 
speech.  A  previous  paper  described  a  near-optimsd  admissible  Viterbi  A*  search 
algorithm  for  use  with  non-cross-word  acoustic  models  and  no-grammar  language 
models  [1].  This  report  extends  this  algorithm  to  include  unigram  language  models 
and  describes  a  modified  version  of  the  algorithm  which  includes  the  full  (forward) 
decoder,  cross-word  acoustic  models  and  longer-span  language  models.  The  resul¬ 
tant  algorithm  is  not  admissible,  but  has  been  demonstrated  to  be  very  efficient. 
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1.  INTRODUCTION 


Speech  recognition  may  be  treated  as  a  tree  network  search  problem.  As  one  proceeds  from 
the  root  toward  the  leaves,  the  branches  leaving  each  junction  represent  the  set  of  words  that  may  be 
appended  to  the  current  partial  sentence.  Each  of  the  branches  leaving  a  junction  has  a  probability 
and  eaK:h  word  has  a  likelihood  of  being  produced  by  the  observed  acoustic  data.  The  recognition 
problem  is  to  identify  the  most  likely  path  (word  sequence,  W^*)  from  the  root  (beginning  of  the 
sentence)  to  a  leaf  (end  of  the  sentence)  taking  into  account  the  junction  probabilities  (the  stochastic 
language  model)  p(H')  and  the  acoustic  match  (including  time  alignment)  p(OjM^)  given  that  path 
[2] 


W*  =ssgm^p{0\W)p{W)  (1) 

{H’l 

where  O  is  the  acoustic  observation  sequence  and  W'  is  a  word  sequence.  Similarly,  for  no-grammar 
language  model  recognition,  the  problem  is  to  identify'  the  most  likely  word  sequence  given  only 
the  acoustic  data 

ir*  =argmaxp(0|U').  (2) 

{u-} 

By  Bayes  rule,  likelihoods  can  be  substituted  for  the  probabilities  in  either  of  the  above  equations 
without  changing  the  recognized  sentence  H'*. 

This  report  is  concerned  with  the  network  search  problem  and  therefore  correct  recognition  is 
defined  as  outputting  the  most  likely  sentence  H’*  given  the  language  model,  the  acoustic  models, 
and  the  observed  acoustic  data.  If  the  most  likely  sentence  is  not  the  one  spoken,  it  is  a  modeling 
error — not  a  search  error.  This  paper  will  assume  for  simplicity  that  an  isolated  sentence  is  the 
object  to  be  recognized.  (The  algorithm  extends  trivially  to  recognize  continuous  input.) 

This  report  will  assume  a  stochastic  acoustic  model,  such  as  a  hidden  Markov  model  (HMM) 
[2,3,4],  and  a  stochastic  language  model  [5],  An  accept-reject  lemguage  model  will  also  work — its 
output  log-likelihood  is  either  zero  or  minus  infinity. 

This  report  will  initially  describe  the  basic  stack  decoder  and  the  A*  scoring  criterion.  It 
will  then  describe  a  near-optimal  admissible  search  algorithm  for  using  the  (Viterbi)  stack  decoder 
to  perform  continuous  speech  recognition  (CSR)  with  a  no-graimmar /unigram  language  model. 
Finally,  it  will  show  how  to  modify’  this  algorithm  to  produce  an  efficient  st2w:k  decoder  that  includes 
the  full  (forward)  decoder,  cross-word  acoustic  models,  and  longer-span  language  models. 
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2.  THE  BASIC  STACK  DECODER 


The  stack  decoder  [6],  as  used  in  speech,  is  an  implementation  of  a  best-first  tree  search.  The 
basic  operation  of  a  sentence  decoder  is  as  follows  [2,7]; 

1.  Initialize  the  stack  with  a  null  theory. 

2.  Pop  the  best  (highest  scoring)  theory  off  the  stack. 

3.  if(eT’d-o^-sentence)  output  the  sentence  and  terminate. 

4.  Perform  acoustic  and  language-model  fast  matches  to  obtain  a  short  list  of  candidate 
word  extensions  of  the  theory. 

5.  For  each  word  on  the  candidate  list; 

(a)  Perform  acoustic  and  language-model  detailed  matches  to  compute  the  new 
theory  output  log-likelihood. 

i.  if(not  end-of-sentence)  insert  into  the  stack. 

ii.  if(end-of-sentence)  insert  into  the  steick  with  end-of-sentence  fiag  =  TRUE. 

6.  Go  to  2. 

The  fast  matches  [7,8]  are  computationally  cheap  methods  for  reducing  the  number  of  word  exten¬ 
sions  that  must  be  checked  by  the  more  accurate,  but  computationally  expensive  detailed  matches.^ 
(The  fast  matches  may  also  be  considered  a  predictive  component  for  the  detailed  matches.)  Top-N 
(N-best)  mode  is  achieved  by  delaying  termination  until  N  sentences  have  been  output. 

The  stack  itself  is  just  a  sorted  list  that  supports  the  following  operations:  pop  the  best  entry 
and  insert  new  entries  according  to  their  scores.  The  following  items  must  be  contained  in  the  ith 
stack  entry: 

1.  A  stack  score:  St  Sc, 

2.  A  reference  time:  t.refi 

3.  A  word  history  (path  or  theory  identification) 

4.  An  output  log-likelihood  distribution:  Li{t) 

5.  An  end-of-sentence  flag 


^  The  following  discussion  concerns  the  basic  stack  decoder  and  therefore  it  will  be  assumed  that  the 
correct  word  will  always  be  on  the  fast  match  list.  This  can  be  guaranteed  by  the  scheme  outlined 
in  Bahl  et  al  [7].  All  of  the  theoretical  results  contained  in  this  report  assume  no  fast  matches. 
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The  stack  score  and  the  reference  time  are  used  to  sort  the  stack.  Since  the  time  of  exiting  the 
last  word  of  the  theory  cannot  be  uniquely  determined  without  knowing  the  next  word,  the  output 
log-likelihood  as  a  function  of  time  must  be  contained  in  each  entry.  This  distribution  is  the  input 
to  the  next  word  model.  The  end-of-sentence  flag  identifies  the  theories  that  are  candidates  to  end 
the  sentence. 
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3.  THE  A*  STACK  CRITERION 


A  key  issue  in  the  stack  decoder  is  deciding  which  theory  should  be  popped  from  the  stack 
to  be  extended.  This  is  decided  by  the  stack  score  and  the  refeit,-ncc  time.  (All  scores  used  here 
are  log-likelihoods  or  log-probabilities.)  If  one  uses  the  raw  log-probabilities  as  the  stack  score,  a 
uniform  search  [9]  will  result.  This  search  will  result  in  a  prohibitive  amount  of  computation  and 
a  very  large  stack  because  the  log-probabilities  decrease  rapidly  with  path  length  and  thus  short 
paths  will  be  “c2irried  along”  by  the  better  paths.  A  better  scoring  criterion  is  the  A*  criterion 
[9].  The  (near-optimal)  A*  criterion  used  here  is  the  diflFerence  between  the  actual  log-likelihood  of 
reaching  a  point  in  time  on  a  path  and  an  upper  bound  on  the  log-likelihood  of  any  path  reaching 
that  point  in  time: 

Ai{t)  ^  Liit)  -  nhL(t)  (3) 

where  Ai(t)  is  the  A*  scoring  function.  Li{t)  is  the  output  log-hkelihood,  t  denotes  time,  i  denotes 
the  path  (tree  branch  or  left  sentence  fragment)  and  ubL(t)  is  an  upper  bound  on  Li{t).  (This 
criterion  is  similar  to  the  exact  A*  criterion  stated  in  Nilsson  [9]  except  that,  unhke  the  exact  A* 
criterion,  it  can  be  computed  at  negligible  cost  without  using  data  from  the  future.  See  Appendix 
A  for  a  derivation.)  In  order  to  sort  the  stack  entries,  it  is  necessary  to  reduce  the  A,(t)  to  a  single 
number  (the  stack  sco»c); 

St5c,  =max  A, (t).  (4) 

It  is  also  convenient  at  this  point  to  define  the  minimum  time  that  satisfies  Equation  (4): 

tjmiui  =argmin  (StSc^  =  Ai{t)).  (5) 

t 

Idetdly,  ubL(t)  would  be  the  least  upper  bound  on  L,{t):  lubL(t).  In  general,  the  cloaer  ubL(t)  is 
to  lubL(f).  the  less  computation.  If  ubL(t)  becomes  less  than  lubL(<).  longer  paths  will  be  favored 
excessively  and  the  first  output  sentence  may  not  have  the  highest  log- likelihood,  i.e.,  a  search  error 
may  occur.  (Note  that  ubL(t)  is  constant  for  any  t  and  therefore  does  not  affect  the  relative  scores 
of  the  paths  at  any  fixed  time — it  only  affects  the  comparison  of  paths  of  differing  lengths  and  the 
resultant  order  of  path  expansion.) 

The  stack  search  will  be  admissible  [9]  if  the  following  conditions  are  met: 

ubZ,(f)  >  \uhL(t)  (6) 
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and 


ubL(t2)  -  lubL(t2)  >  uhL{ti)  —  lub£,(ti)  <2  >  (7) 

and  it  will  be  near-optimal  [9]  ir  the  sense  that  a  near-minimum  number  of  theories  will  need  to 
be  expanded  for  an  admissible  search  if 

ubL(f)  =  lubL(t).  (8) 

The  additional  condition  stated  in  Equation  (7)  over  those  found  in  Nilsson  [9]  is  sufficient  (but 
not  necessary)  because  the  word  acoustic  models  can  “jump”  over  parts  of  the  bound  and  therefore 
a  locally  poor  bound  can  block  the  correct  theory  while  passing  those  that  are  able  to  jump  over 
the  poor  region. 

A  basic  problem  is  obtaining  a  good  estimate  of  ubL(t)  in  a  time-asvnchronous  decoder. 
(Note  that  lubLgtate(^}  over  the  states  is  easily  computed  in  a  time-synchronous  decoder  and  that 
Agtate(t)  is  the  value  compared  to  the  pruning  threshold  in  a  beam  search  [10].)  One  simple  estimate 
of  lubL(t)  is 


lubi(f)  =  -at.  (9) 

where  q  is  some  constant  greater  than  zero.  This  approeich  attempts  to  cancel  out  the  average 
log-likelihood  per  time  step.  If  a  is  too  large,  it  will  underestimate  the  bound  and  risk  recognition 
errors.  If  a  is  small  enough,  the  search  will  be  admissible  but  will  require  an  excessive  amount 
of  computation.  (In  fact,  q  =  0  is  the  uniform  search  mentioned  above.)  Unfortunately  no  single 
value  of  a  is  optimum  for  all  sentences  or  all  parts  of  a  single  sentence.  Thus  a  conservative  value 
must  be  chosen  and  the  computation  will  be  excessive. 

This  estimate  of  the  bound  is  not  useful  for  controlling  the  stack  decoder,  but  it  is  auiequate  to 
estimate  the  most-probable  theory  exit  time.  Given  an  appropriate  value  for  q,  Ai(t)  will  exhibit 
a  peak  whose  location  is  an  estimate  of  the  most-probable  exit  time  of  the  theory.  (This  stack 
decoder  only  implements  the  forward  decoder — findin^  the  exact  most-probable  exit  time  requires 
information  from  the  decode  of  some  amount  of  the  remainder  of  the  sentence.)  Therefore  the 
estimated  exit  time  is 


i.exiti  =argmax  Li{t)  —  at. 

t 


(10) 
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4.  THE  NEAR-OPTIMAL  A*  STACK  DECODER  FOR  RECOGNITION 
WITH  A  NO-GRAMMAR  OR  UNIGRAM  LANGUAGE  MODEL 


It  is  not  possible  to  compute  the  exact  least  upper  bound  on  the  theory  likelihoods  without 
first  performing  the  recognition.  It  is,  however,  possible  to  compute  the  least-upper-bound-so-fai 
(lubsf)  on  the  likelihoods  that  have  already  been  computed,  which  requires  negligible  computation 
and  is  to  be  sufficient  to  perform  the  near-optimal  A*  search.  This  creates  two  problems: 

1.  Since  lubL(t)  =  lubsfL(t)  can  change  as  the  theories  are  evaluated,  the  stawJt  order 
can  also  change. 

2.  A  degeneracy  in  determining  the  best  path  by  StSc  alone  can  occm  because  lubsfL(<) 
can  equal  L,{t)  for  more  than  one  i  (path)  at  different  times. 

Problem  1  is  easily  cured  by  reevaluating  the  stack  score  StSc  every  time  lubsfLft)  is  updated 
and  reorganizing  the  stack.  This  lo  easily  accomplished  if  the  stack  is  stored  as  a  heap  [11]. 

Problem  2  occurs  because  different  theories  may  dominate  different  pau'ts  of  the  current  upper 
bound.  Thus  all  of  these  theories  will  fave  a  score  of  zero.  If  the  longest  theory  is  extended,  its 
descendents  will  in  turn  dominate  the  longest  part  of  the  bound  and  will  therefore  be  extended. 
This  will,  of  course,  result  in  search  errors  because  the  shorter  theories  will  never  have  a  chance  to 
be  extended.  The  cure  is  to  extend  the  shortest  theory  (minimum  t.min)  that  has  a  stack  score 
equal  to  the  best.  If  t.ref,  =  t.min,.  this  can  be  accomplished  by  performing  a  major  sort  on  the 
stack  score  StSc  and  a  minor  sort  on  the  reference  time  t.ref. 

This  guarantees  that  lubsfZ.(t)  =  lubZ,(f)  for  t  <  t.refp  (where  p  denotes  the  theory  that  is 
about  to  be  popped)  and  therefore  the  relevant  part  of  the  least-upper-bound  has  been  computed 
by  the  time  that  it  is  needed.  Since  the  bound,  at  the  time  that  it  is  need,  is  the  least-upper- 
bound.  the  search  is  admissible  and  near-optimal.  Furthermore,  when  the  first  sentence  is  output, 
the  least-upper-bound-so-far  will  be  the  exact  least -upper- bound.  (This  assumes  no  fast  match  was 
used.  An  aggressive  fast  match  can  cause  some  portions  of  the  lubsf  to  underestimate  the  exact 
lub  but  this  will  not  destroy  the  admissibility  of  the  search  as  long  as  the  correct  word  is  on  the 
fast  match  list.) 

A  stack  pruning  threshold  can  be  used  to  limit  the  stack  size  [1].  Any  theory  whose  StSc  falls 
below  the  threshold  can  be  deleted  from  the  stack.  This  can  be  aoplied  on  stack  insertions  and  any 
time  the  stack  is  reorganized.  (Any  update  to  the  lubsf  will  only  cause  the  StSc  to  be  reduced,  so 
any  theory  that  is  pruned  will  not  be  accepted  by  the  pruning  threshold  at  a  later  portion  of  the 
search.)  This  stack  pruning  threshold  has  little  effect  on  the  computational  requirements  and  can 
therefore  be  set  very  conservatively  to  essentially  eliminate  any  chance  that  the  correct  theory  will 
be  pruned. 

In  a  time-synchronous  (TS)  no-grammar/unigram  language  model  Viterbi  decoder,  all  word 
output  likelihoods  are  compared  and  only  the  maximum  is  passed  on  as  input  to  the  word  models. 
Thus  by  comparison,  only  theories  that  dominate  the  lubsf  need  be  retained  on  the  stack  and  the 


stack  pruning  threshold  can  be  set  to  zero  for  top-1  recognition.  Since  all  stack  scores,  StSc,  of  all 
theories  popped  from  the  stack  will  be  zero  until  the  first  sentence  is  output,  all  theories  popped 
from  the  stack  will  be  in  reference  time  t.min  order.  (Of  course,  the  stack  pruning  threshold  must 
be  nonzero  if  a  top-N  list  of  sentences  is  desired.)  For  top-N  recognition,  this  algorithm  awlaptively 
raises  the  effective  computational  pruning  threshold  (which  equals  the  current  best  StSc)  by  the 
minimum  required  to  produce  N  output  sentences,  subject  to  the  limit  placed  by  the  stack  pruning 
threshold. 

A  time-synchronous  Viterbi  decoder  must  sometimes  make  an  arbitrary  decision  when  two 
output  likelihoods  are  equal,  as  can  be  caused  by  homonyms  with  identical  acoustic  models.  The 
top-1  stack  decoder  will  maintain  both  theories  and,  if  all  theories  with  hkelihood  equal  to  that 
of  the  best  theory  are  output,  all  homonym  variations  will  be  output.  (This  does  not  require  a 
nonzero  stack  pruning  threshold  because  both  theories  will  have  the  same  likelihood.) 

As  is  shown  in  Appendix  A.  this  algorithm  is  near-optimal  and  admissible  only  for  a  Viterbi 
decode  using  non-cross-word  acoustic  models  and  a  no-grammar  or  unigram  language  model  In  a 
full  (forward)  decode,  the  output  likelihood  of  one  theory — which  may  later  have  a  higher  hkelihood 
due  to  summing  of  the  probabilities  of  paths  with  different  time  alignments — may  be  “shadowed’’ 
by  the  output  likelihood  of  a  second  theory.  The  first  theory  would  therefore  not  be  expzmded.  In 
general,  the  recognition  accuracy  of  the  Viterbi  decoder  is  very  similar  to  the  accuracy  of  the  full 
decoder.  (Small  differences  have  been  observed  in  some  experiments.) 

The  long-span  language  model  flaw  in  this  algorithm  can  similarly  cause  shadowing  of  the 
correct  theory  [12’.  A  language  model  can  improve  the  relative  hkelihood  of  a  theory  such  that  a 
currently  shadowed  theory  can  dominate  the  bound  after  expansion.  The  above  algorithm  will  not 
allow  the  shadowed  theory  to  be  expanded  until  all  better  (higher  stack  score)  theories  have  been 
expanded.  Such  will  occur  if  the  system  is  run  in  topvN  mode  for  a  large  enough  N.  Similarly  an 
end-of-sentence  penalty  can  force  a  shadowed  theory  to  be  expanded.  Unfortunately  both  of  these 
methods  greatly  increase  the  computation. 
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5.  AN  EFFICIENT  STACK  DECODER  ALGORITHM  FOR  USE  WITH  A 
MULTIPLE- WORD  SPAN  LANGUAGE  MODEL 


An  efficient  stack  decoder  algoritlim  that  can  be  used  with  cross-word  Ju:oustic  models,  the 
full  (forward)  decoder,  zuid  longer-span  (>  2)  language  models  can  be  produced  by  two  simple 
changes: 

1.  Change  the  stack  ordering  to  be  a  major  sort  on  the  reference  time  t_re/  (favoring 
the  lesser  times)  and  a  minor  sort  on  the  stack  score  StSc. 

2.  Use  a  non-zero  stack  pruning  threshold. 

The  reference  time  t-ref  may  also  be  changed  from  the  minimum  time  that  satisfies  Equation  (4) 
used  in  the  no-grammar/unigram  Izmguage- model  version  to  i.exit  as  defined  in  Exjuation  (10). 
(Either  will  work  and  both  required  similar  amounts  of  computation  in  tests.)  This  algorithm 
appears  to  be  a  simplification  of  one  developed  at  IBM  [13].  This  algorithm  is  not  admissible 
because  the  correct  theory  can  be  pruned  from  the  stack.  The  stack-pruning  threshold  now  controls 
the  trade-off  between  the  amount  of  computation  and  the  probability  of  pruning  the  correct  theory 
by  controlling  the  likelihood  “depth”  that  will  be  searched.  Unlike  the  previous  algorithm,  an 
(unpruned)  theory  will  not  be  shadowed  because  it  will  be  extended  when  its  reference  time  is 
reached.  This  algorithm  is  quasi-time-synchronous  because  it,  in  effect,  moves  a  time  boimd  forward 
and  whenever  this  time  bound  becomes  equal  to  the  reference  time  of  a  theory,  the  theory  is 
expanded.  (Descendants  of  the  theory  will  generally  have  reference  times  farther  forward  in  time.) 

Note  that  the  stack  pruning  threshold  can  also  be  set  to  zero  for  no-grammar /unigram  lan¬ 
guage  model  top-1  recognition  with  this  algorithm.  With  a  zero  stack  pruning  threshold  and 
t.ref,  ~  t.min,.  it  becomes  equi\'alent  to  the  near-optimal,  admissible  no-grammar /unigram  lan¬ 
guage  model  algorithm  described  above  for  top>-l  recognition.  (While  this  algorithm  can  also 
perform  top-N  recognition  with  or  without  a  language  model,  it  cannot  be  made  equivalent  to  the 
no-grammar/unigram  language  model  version  for  top-N.  Its  pruning  threshold  is  fixed  and  it  will 
only  output  theories  whose  relative  likelihoods  do  not  f2ill  below  the  threshold.) 

Given  that  this  stack  decoder  algorithm  has  become  quasi-time-synchronous,  what  are  its 
advantages  over  a  time  synchronous  system? 

1.  It  efficiently  combines  a  long-span  language  model  with  the  acoustic  recognition  into 
the  search. 

2.  It  is  an  effective  control  strateg>'  for  simultaneous  interpretation  of  large  language 
and  acoustic  models. 

3.  It  has  a  tighter  pruning  threshold  than  a  TS  system  because  the  threshold  is  applied 
only  to  word  end  likelihoods  after  the  language  model  has  been  applied  rather  than 
to  all  states  ( which  includes  word  internal  states  as  well  as  states  immediately  before 
and  after  application  of  the  language  model). 
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4.  It  may  require  fewer  applications  of  the  fast  match  since  the  fast  match  need  only 
be  applied  once  per  popped  theory  rather  than  once  per  time  step.  (This  cam  be 
guaranteed  by  am  appropriately  chosen  faist  match  algorithm.) 

5.  It  produces  a  top-N  sentence  output  with  only  negligible  modification  to  the  adgo- 
rithm. 

6.  It  can  perform  a  Viterbi  decode  or  an  exawrt  full  decode  without  difficulty.  The 
likelihood  “depth”  of  the  search  combined  with  the  time-based  major  (stack)  sort 
allows  pragmatically  admissible  operation  of  the  full  decode. 
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6.  DISCUSSION  AND  CONCLUSIONS 


The  stack-search  algorithms  discussed  in  this  report  have  been  implemented  in  a  prototype 
that  uses  real  speech  input,  but  does  not  yet  have  all  of  the  features  of  the  Lincoln  TS  CSR 
[14,15].  (The  primary  missing  features  are  cross-word  phonetic  modeling  and  tied-mixtures.)  The 
prototype  runs  faster  than  does  the  TS  system  on  the  corresponding  recognition  task,  frequently  by 
a  significant  factor.  (In  fairness,  the  TS  system  does  not  include  a  fast  match.)  Current  experience 
using  the  DARPA  Resource  Management  Database  [16]  shows  the  required  number  of  stack  pops 
and  the  stack  size  to  be  surprisingly  small.  In  addition,  the  prototype  includes  a  proposed  CSR  - 
NL  interface  [17]  and  has  been  run  with  unigram,  word-paiir,  bigram,  and  trigram  Izmguage  models 
accessed  through  the  interface  without  difficulty.  (It  has  also  been  run  using  a  no-grammar  language 
model,  which,  of  course,  does  not  require  the  interface.) 

Methods  for  joining  the  acoustic  matching  of  separate  theories  and  caching  of  acoustic  com¬ 
putations  to  reduce  the  acoustic  match  computation  were  described  in  Paul  [1].  These  algorithms 
were  tested  in  a  stack-decoder  simulator  (real  stack  decoder  with  simulated  input  data).  These 
accelerators  have  not  been  tested  in  the  prototj'pe  stack  decoder,  but  there  is  no  reason  in  principle 
why  they  could  not  be  used. 

A*  search  using  the  scoring  function  described  by  Nilsson  [9]  [Elquation  (A.l)]  requires  com¬ 
puting  the  likelihood  of  the  future  data  (h*(t)  in  Equation  (A. 4).  The  optimal  A*  decoder  requires 
exact  evaluation  of  h*{t).  which  requires  solving  the  top-1  recognition  problem  by  some  other 
means,  such  as  a  reverse  direction  TS  decoder  [18],  before  the  A*  search  can  begin.  The  alterna¬ 
tive  described  here  substitutes  a  near-optimal  scoring  function  that  is  derived  from  the  A*  search 
and  requires  negligible  additional  computation  over  that  required  by  the  search  itself.  Since,  as 
noted  above,  the  Lincoln  top-1  TS  decoder  takes  more  CPU  time  than  does  the  near-optimal  stack 
decoder,  the  near-optimal  stack  decoder  algorithm  appears  to  be  the  most  efficient  of  the  three 
approaches  for  top-1  recognition.  In  addition,  the  inadmissible  version  of  the  near-optimal  stack 
decoder  can  very  easily  integrate  long-span  language  models  into  the  search.  However,  if  top-N 
recognition  is  the  goal,  the  optimal  A*  search  may  be  preferred  because,  once  the  price  is  paid 
for  computing  h*(t).  the  A*  search  can  find  the  additional  N-1  sentences  very  efficiently  for  no- 
grammar/unigram  language  models  [18].  A  longer  span  language  model  would  require  computing 
h*{t)  [Equation  (A.l)]  rather  h*(t)  [Equation  (A.4)],  which  would  increase  the  cost  of  computing 
the  additional  sentences. 

Recently  several  other  algorithms  have  been  proposed  for  top-N  recognition  using  A*  search 
[19,18,20]  that  use  the  Nilsson  formulation  of  the  scoring  fimction.  AU  of  these  approaches  use  a 
reverse  direction  TS  decoder  to  compute  (There  are  also  some  proposed  non-A*  methods 

for  recognizing  the  top-N  sentences  [21,22,23].  In  general,  the  bidirectional  approaches  appear 
to  be  more  efficient  than  the  unidirectional  approaches.)  These  A*  (and  bidirectional)  methods 
must  wait  for  the  end  of  data  (or  a  pseudo-end-of-data)  to  begin  the  A*  (or  the  reverse  direction) 
pass.  In  contrast,  because  they  do  not  need  data  beyond  that  necessary  to  extend  the  current 
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theory  (this  includes  data  up  to  tjref  required  to  choose  the  current  theory),  the  two  stack  decoder 
formulations  proposed  here  can  proceed  totally  left-to-right  as  the  input  data  becomes  available 
from  the  front  end.  The  multiple-word-span  language-model  version  of  the  stack  search  will  output 
all  top-N  theories  with  minimal  delay  following  the  end-of-data  because  adl  theories  ane  pursued  in 
quasi-parallel  or,  in  top-1  mode,  it  can  output  the  partial  sentence  as  soon  as  all  unprimed  theories 
have  a  common  partial  history  (initial  word  sequence).  (A  similar  technique  for  continuous  output 
after  a  short  delay  from  continuous  input  exists  for  TS  decoders  [24]). 

One  of  the  motivations  for  some  of  these  other  A*  (and  topv-N)  algorithms  is  as  a  method  for 
using  weaker  and  cheaper  initial  acoustic  and  language  models  to  produce  a  top-N  sentence  list  for 
later  refinement  by  more  detailed  and  expensive  acoustic  and/or  language  models,  which  now  need 
only  consider  a  few  theories.  In  contrast  the  algorithm  proposed  here  integrates  both  the  detailed 
acoustic  and  language  models  directly  in  the  stack  search  and  therefore  need  only  produce  a  top-1 
output.  It  attempts  to  minimize  the  computation  by  applying  all  available  information  to  constrain 
the  search.  (The  stack  decoder  as  described  here  can,  of  course,  also  be  used  with  weak  and  cheap 
acoustic  and/or  language  models  to  produce  a  top-N  list  for  later  processing.)  The  ultimate  choice 
between  the  two  methods  may  be  determined  by  the  number  of  sentences  required  by  the  top- 
N  approaches  and  the  relative  computational  costs  of  the  various  modules  in  each  system.  The 
architectural  simplicity  of  each  system  may  also  have  some  bearing. 

The  stack  decoder  has  long  shown  promise  for  integrating  long-span  language  models  and 
acoustic  models  into  a  single  effective  search  which  applies  information  from  both  sources  into 
controlling  the  search.  It  has  not  been  used  at  many  sites,  primarily  due  to  the  difliculty  of  making 
the  search  efficient.  The  algorithms  described  above  will  hopefully  remove  this  barrier. 


12 


APPENDIX  A 

DERIVATION  OF  THE  A*  CRITERION  USED  IN  EQUATION  (3) 


as 


Nilsson  [9]  states  the  A*  criterion  (slightly  rewritten  to  match  the  speech  recognition  problem) 


fi{t)  =  9rit)  +  Kit)  (A.l) 

where  fi{t)  is  the  log-likelihood  of  a  sentence  with  the  partial  theory  i  ending  at  time  t,  gi(t)  is 
the  log-likelihood  of  p«irtial  theory  i,  and  K(t)  is  the  log-likelihood  of  the  best  extension  of  theory 
i  from  time  t  to  the  end  of  the  data.  (Nilsson  uses  costs  which  are  interpreted  here  as  negative 
log-likelihoods.  All  descriptions  here  will  use  sign  conventions  appropriate  for  log-likelihoods  to  be 
consistent  with  the  rest  of  the  paper.)  The  theory'  argmax  (max  fi(t))  is  chosen  as  the  next  to  be 

t  * 

popped  from  the  stack  and  expanded. 

Equation  (A.l)  requires  that  the  computation  of  the  total  likelihood  of  a  sentence  must  be 
separable  into  a  beginning  part  and  an  end  part  separated  by  a  single  time,  which  disallows  this 
derivation  for  the  full  (forward)  decoder  because  the  full  decoder  does  not  have  a  unique  transition 
time  between  two  words.  Thus,  the  deri\'ation  is  limited  to  a  decoder  which  is  Viterbi  between 
words.  To  allow  the  h‘{t)  terms  to  cancel  in  Equation  (A. 5),  define  K(t)  to  be  independent  of  the 
theory 


h’(t)  =  h:(t).  (A.2) 

Therefore  Equation  (A.l)  can  be  rewritten  as 

Mt)=9,(t)  +  h^(t).  (A.3) 

This  requirement  limits  the  derivation  to  non-cross-word  acoustic  model  and  no-grammar  or  uni¬ 
gram  language  model  recognition  tasks. 

Define 


r(t)  =  ff-(t)  +  h-(t)  (A.4) 

for  the  best  theory  with  a  word  transition  at  time  t.  The  function  f*(t)  is  slowly  varying  with  global 
maxima  at  the  word  transition  points  of  the  correct  theory,  at  which  points  it  equals  the  likelihood 
of  the  correct  theory.  Specifically,  it  is  maximum  at  t  =  0  and  t  =  T.  (T  is  the  end  of  data.)  Since 
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gi(t)  is  an  exact  value  (rather  than  a  bound  or  estimate)  for  a  tree  search,  g*{t)  =  lubgi(t)  and 
since  h*{t)  is  not  a  function  of  i,  /*(t)  =  lub/j(t). 

Subtract  Equation  (A.4)  from  Exjuation  (A.3)  and  define  /,(t) 


/i(t)  =  /,(t)-r(t)  =  5i(t)-5*(t)-  (A.5) 

This  is  just  Equation  3  in  a  different  notation:  gi{t)  =  Li{t)  and  g*{t)  =  ubL(t)  (specifically 
luhL(t))  and  therefore  /i(t)  =  A,(t).  Thus,  if  /*(t)  were  a  constant,  fi(t)  would  just  be  an  offset 
from  fi(t)  and  the  search  would  be  optimum  because  argmax  (max  would  always  be  equal 

t  * 

to  argmax  (max  fi{t)).  As  noted  earlier,  f*{t)  has  maxima  at  word  transition  times  of  the  correct 

i  * 

theory.  Thus  fi{t)  is  zero  at  word  transition  times  on  the  correct  theory  and  <  0  for  all  other 
i  and  t.  Thus  the  search  is  admissible  because  it  can  never  block  the  correct  theory  by  giving  a 
better  score  to  an  incorrect  theory,  but  sub-optimal  because  it  can  cause  incorrect  theories  to  be 
popped  from  the  stack  and  be  evaluated.  Since  the  evaluation  function  “error”  /*(t)  -  /*(0)  is 
slowly  varying,  the  search  is  near-optimal. 

Since  the  stack  decoder  treats  each  theory  and  all  points  on  the  hkelihood  distribution  Li{t) 
as  a  unit,  each  theory  is  evaluated  at  its  optimum  point;  the  max  Ai{t)  as  defined  in  Equation  4, 
to  give  it  its  “best”  chance  and  then,  for  efficiency,  the  likelihood  of  all  points  on  the  distribution 
L,{t)  are  extended  in  one  operation. 

The  fact  that  all  StSci  are  zero  until  the  first  sentence  is  output  and  the  tie  is  broken  by 
choosing  the  theory  with  the  minimum  reference  time  t.min  ensures  that  all  candidate  theories 
that  might  alter  lubsfL,(t  <  t.minp)  have  already  been  computed.  Thus  the  lubsfL(f)  =  lubL(t) 
for  t  <  t.minp. 

This  derivation  shows  the  stack  criterion  max  StSci  with  a  minimum  t.mirii  tie-breaker  to  be 
adequate  to  perform  a  near-optimal  admissible  A^-search  Viterbi-recognition  with  a  no-grammar 
or  unigram  language-model  using  a  stack  decoder  algorithm. 
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algorithm  to  include  unigram  language  models  and  descrilies  a  modified  version  of  the  algorithm  which  includes 
the  full  (forward  I  decoder,  cross-word  acoustic  models  and  longer-span  language  models.  The  resultant  algorithm 
is  not  admissible,  but  has  been  demonstrated  to  be  verv  efficient. 
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